The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications

نویسندگان

  • Richard Graham
  • Joshua Hursey
  • Geoffroy Vallée
  • Thomas Naughton
  • Swen Boehm
چکیده

Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when processes fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applications and libraries in the presence of system component failures. This paper discusses how the RTS proposal impacts system services, highlighting specific requirements. Early experimentation results from Cray systems at ORNL using prototype MPI and runtime implementations are presented. Additionally, this paper outlines fault tolerance techniques targeted at leadership class applications. Keywords-MPI; Fault Tolerance; Runtime Environment; Algorithm Based Fault Tolerance; Run-through Stabilization

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Fault Tolerance Solution for Sequential and MPI Applications on the Grid

The Grid community has made an important effort in developing middleware to provide different functionalities, such as resource discovery, resource management, job submission or execution monitoring. As part of this effort this paper addresses the design and implementation of an architecture (CPPC-G) based on services to manage the execution of fault tolerant applications on Grids. The CPPC (Co...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance

The MPI standard lacks semantics and interfaces for sustained application execution in the presence of process failures. Exascale HPC systems may require scalable, fault resilient MPI applications. The mission of the MPI Forum’s Fault Tolerance Working Group is to enhance the standard to enable the development of scalable, fault tolerant HPC applications. This paper presents an overview of the ...

متن کامل

Voting Algorithm Based on Adaptive Neuro Fuzzy Inference System for Fault Tolerant Systems

some applications are critical and must designed Fault Tolerant System. Usually Voting Algorithm is one of the principle elements of a Fault Tolerant System. Two kinds of voting algorithm are used in most applications, they are majority voting algorithm and weighted average algorithm these algorithms have some problems. Majority confronts with the problem of threshold limits and voter of weight...

متن کامل

Scalable Fault Tolerant MPI: Extending the Recovery Algorithm

Fault Tolerant MPI (FT-MPI)[6] was designed as a solution to allow applications different methods to handle process failures beyond simple check-point restart schemes. The initial implementation of FTMPI included a robust heavy weight system state recovery algorithm that was designed to manage the membership of MPI communicators during multiple failures. The algorithm and its implementation alt...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012